Goto

Collaborating Authors

 sample question


What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

Jo, Nathanael, Wilson, Ashia

arXiv.org Artificial Intelligence

Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported accuracy genuinely reflects a model's true performance? Evaluations are often presented as simple measurements, but in reality they are inferences: to treat benchmark scores as evidence of capability is already to assume a theory of what capability is and how it manifests in a test. We make this step explicit by proposing a principled framework for evaluation as inference: begin from a theory of capability, and then derive methods for estimating it. This perspective, familiar in fields such as psychometrics, has not yet become commonplace in AI evaluation. As a proof of concept, we address a central challenge that undermines reliability: sensitivity to perturbations. After formulating a model of ability, we introduce methods that infer ability while accounting for uncertainty from sensitivity and finite samples, including an adaptive algorithm that significantly reduces sample complexity. Together, these contributions lay the groundwork for more reliable and trustworthy estimates of AI capabilities as measured through benchmarks.


A novel interface for adversarial trivia question-writing

Liu, Jason

arXiv.org Artificial Intelligence

A critical component when developing question-answering AIs is an adversarial dataset that challenges models to adapt to the complex syntax and reasoning underlying our natural language. Present techniques for procedurally generating adversarial texts are not robust enough for training on complex tasks such as answering multi-sentence trivia questions. We instead turn to human-generated data by introducing an interface for collecting adversarial human-written trivia questions. Our interface is aimed towards question writers and players of Quiz Bowl, a buzzer-based trivia competition where paragraph-long questions consist of a sequence of clues of decreasing difficulty. To incentivize usage, a suite of machine learning-based tools in our interface assist humans in writing questions that are more challenging to answer for Quiz Bowl players and computers alike. Not only does our interface gather training data for the groundbreaking Quiz Bowl AI project QANTA, but it is also a proof-of-concept of future adversarial data collection for question-answering systems. The results of performance-testing our interface with ten originally-composed questions indicate that, despite some flaws, our interface's novel question-writing features as well as its real-time exposure of useful responses from our machine models could facilitate and enhance the collection of adversarial questions. The code for our interface is available at: https://github.com/Zefan-Cai/QAML


The Complete Collection of Data Science Interviews – Part 1 - KDnuggets

#artificialintelligence

Were you in the situation when the interviewer asked you a situational or technical question, and you froze up? Just because you were not prepared for it. It happens to many, including me. I have tendencies to freeze during technical interviews, and the hiring manager will take it as my weakness to reject me at the initial stage of the recruitment process. To overcome this problem, I started to look at sample interview questions.


British doctors go on the defensive due to 'high-performing' 'GP at Hand' app

The Japan Times

LONDON – A medical chatbot said to perform as well as or even better than human doctors has sparked a war of words in Britain, in a clash over how much the cash-strapped public health service should rely on artificial intelligence. AI company Babylon, which is already working with the National Health Service, claimed its chatbot scored higher marks than real live doctors in "robust tests." The British firm said it quizzed the AI using sample questions for trainee exams set by Britain's Royal College of General Practitioners (RCGP), the professional body for family doctors. The programmed chatbot, a key feature of Babylon's "GP at Hand" app, scored 81 percent when sitting the test for the first time, while the average pass mark over the past five years for doctors was 72 percent, according to the company. Ali Parsa, its founder who presented the findings in London earlier this week, hailed the results as "a landmark." "(They) take humanity a significant step closer to achieving a world where no one is denied safe and accurate health advice," he said in a statement.


ConferenceCall 2017 04 05 - OntologPSMW

#artificialintelligence

A fast performing, dictionary-based tagger [1] constitutes EXTRACT's core. The tagger relies on a set of dictionaries that map biological names to corresponding terms in biological ontologies, or to pertinent records in public biological databases.


Could YOU pass the secretive Oxford entrance exam? University reveals some of its most common questions - and how to answer them

Daily Mail - Science & tech

It's a question you might never have considered before – why do older siblings do better on IQ tests than their younger counterparts? But if you want to get into Oxford's experimental psychology program, you'd better be prepared to answer. The university has released a series of questions from tutors who conduct the infamous interviews, revealing the complex problems in everything from mathematics to medicine used to spot the sharpest candidates. Oxford has released a series of questions from tutors who conduct the infamous interviews, revealing the complex problems in everything from mathematics to medicine used to spot the sharpest candidates. Oxford has revealed five interview questions spanning Modern Languages, Medicine, Philosophy, Maths, and Experimental Psychology.